recovery function
- North America > Canada (0.04)
- Europe > United Kingdom > England > Lancashire > Lancaster (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > United Kingdom > England > Lancashire > Lancaster (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)
Reviews: Recovering Bandits
The major point that needs to be clarified for me is the distinction between single play regret and multiple play regret. The UCB-Z algorithm is not of this type for example. It is a bit weird to have regret notion that only apply to very specific algorithms - second, regarding the distinction between single and multiple play regret, am I right that E[R_T {(d,m)}] E[R_T {(d)}]? That is, you would in principle care about the multiple play lookahead regret, as one should be allowed to be play an arm multiple times during the d-time step on which we optimize. From my understanding, it would be defined only for strategies which selects each arm at most once during d steps (and therefore d cannot be larger than K in this case, right?).
Dynamic Planning and Learning under Recovering Rewards
Simchi-Levi, David, Zheng, Zeyu, Zhu, Feng
Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from at most $K$ out of $N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over $T$ time periods, we propose, construct and prove performance guarantees for a class of "Purely Periodic Policies". For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of $1-\mathcal O(1/\sqrt{K})$, which is asymptotically optimal when $K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has $\widetilde{\mathcal O}(N\sqrt{T})$ regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Hawaii (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (3 more...)
Recovering Bandits
Pike-Burke, Ciara, Grünewälder, Steffen
We study the recovering bandits problem, a variant of the stochastic multi-armed bandit problem where the expected reward of each arm varies according to some unknown function of the time since the arm was last played. While being a natural extension of the classical bandit problem that arises in many real-world settings, this variation is accompanied by significant difficulties. In particular, methods need to plan ahead and estimate many more quantities than in the classical bandit setting. In this work, we explore the use of Gaussian processes to tackle the estimation and planing problem. We also discuss different regret definitions that let us quantify the performance of the methods. To improve computational efficiency of the methods, we provide an optimistic planning approximation. We complement these discussions with regret bounds and empirical studies.
- North America > Canada (0.04)
- Europe > United Kingdom > England > Lancashire > Lancaster (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)